{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 10.2  XGBoost算法案例实战1 - 金融反欺诈模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.2.1 案例背景**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "信用卡盗刷一般发生在持卡人信息被不法分子窃取后复制卡片进行消费或信用卡被他人冒领后激活消费的情况。一旦发生信用卡盗刷，持卡人和银行都会蒙受一定的经济损失。因此，通过大数据搭建金融反欺诈模型对银行来说尤为重要。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.2.2 模型搭建**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.565793Z",
     "start_time": "2020-11-21T02:10:26.408702Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>换设备次数</th>\n",
       "      <th>支付失败次数</th>\n",
       "      <th>换IP次数</th>\n",
       "      <th>换IP国次数</th>\n",
       "      <th>交易金额</th>\n",
       "      <th>欺诈标签</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>11</td>\n",
       "      <td>3</td>\n",
       "      <td>5</td>\n",
       "      <td>28836</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>21966</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>18199</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>24803</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>7</td>\n",
       "      <td>10</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>26277</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   换设备次数  支付失败次数  换IP次数  换IP国次数   交易金额  欺诈标签\n",
       "0      0      11      3       5  28836     1\n",
       "1      5       6      1       4  21966     1\n",
       "2      6       2      0       0  18199     1\n",
       "3      5       8      2       2  24803     1\n",
       "4      7      10      5       0  26277     1"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_excel('信用卡交易数据.xlsx')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2.提取特征变量和目标变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.573772Z",
     "start_time": "2020-11-21T02:10:26.567788Z"
    }
   },
   "outputs": [],
   "source": [
    "# 通过如下代码将特征变量和目标变量单独提取出来，代码如下：\n",
    "X = df.drop(columns='欺诈标签') \n",
    "y = df['欺诈标签']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3.划分训练集和测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.588735Z",
     "start_time": "2020-11-21T02:10:26.577764Z"
    }
   },
   "outputs": [],
   "source": [
    "# 提取完特征变量后，通过如下代码将数据拆分为训练集及测试集：\n",
    "from sklearn.model_selection import train_test_split\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4.模型训练及搭建"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.649569Z",
     "start_time": "2020-11-21T02:10:26.591725Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "              importance_type='gain', interaction_constraints='',\n",
       "              learning_rate=0.05, max_delta_step=0, max_depth=6,\n",
       "              min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,\n",
       "              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "              tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 划分为训练集和测试集之后，就可以引入XGBoost分类器进行模型训练了，代码如下：\n",
    "from xgboost import XGBClassifier\n",
    "clf = XGBClassifier(n_estimators=100, learning_rate=0.05)\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.2.3 模型预测及评估**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.663532Z",
     "start_time": "2020-11-21T02:10:26.652563Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,\n",
       "       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,\n",
       "       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,\n",
       "       0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,\n",
       "       0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,\n",
       "       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,\n",
       "       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,\n",
       "       1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,\n",
       "       1, 1], dtype=int64)"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 模型搭建完毕后，通过如下代码预测测试集数据：\n",
    "y_pred = clf.predict(X_test)\n",
    "\n",
    "y_pred  # 打印预测结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.678492Z",
     "start_time": "2020-11-21T02:10:26.665527Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>预测值</th>\n",
       "      <th>实际值</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   预测值  实际值\n",
       "0    0    1\n",
       "1    1    1\n",
       "2    1    1\n",
       "3    0    0\n",
       "4    0    1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过和之前章节类似的代码，我们可以将预测值和实际值进行对比：\n",
    "a = pd.DataFrame()  # 创建一个空DataFrame \n",
    "a['预测值'] = list(y_pred)\n",
    "a['实际值'] = list(y_test)\n",
    "a.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.691458Z",
     "start_time": "2020-11-21T02:10:26.681485Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.875"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 可以看到此时前五项的预测准确度为60%，如果想看所有测试集数据的预测准确度，可以使用如下代码：\n",
    "from sklearn.metrics import accuracy_score\n",
    "score = accuracy_score(y_pred, y_test)\n",
    "score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.759275Z",
     "start_time": "2020-11-21T02:10:26.708413Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.875"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们还可以通过XGBClassifier()自带的score()函数来查看模型预测的准确度评分，代码如下，获得的结果同样是0.875。\n",
    "clf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.783212Z",
     "start_time": "2020-11-21T02:10:26.769249Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.8265032  0.1734968 ]\n",
      " [0.02098632 0.9790137 ]\n",
      " [0.0084281  0.9915719 ]\n",
      " [0.8999369  0.1000631 ]\n",
      " [0.8290514  0.17094862]]\n"
     ]
    }
   ],
   "source": [
    "# XGBClassifier分类器本质预测的并不是准确的0或1的分类，而是预测其属于某一分类的概率，可以通过predict_proba()函数查看预测属于各个分类的概率，代码如下：\n",
    "y_pred_proba = clf.predict_proba(X_test)\n",
    "print(y_pred_proba[0:5])  # 查看前5个预测的概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:26.812135Z",
     "start_time": "2020-11-21T02:10:26.791192Z"
    }
   },
   "outputs": [],
   "source": [
    "# 此时的y_pred_proba是个二维数组，其中第一列为分类为0（也即非欺诈）的概率，第二列为分类为1（也即欺诈）的概率，因此如果想查看欺诈（分类为1）的概率，可采用如下代码：\n",
    "# y_pred_proba[:,1]  # 分类为1的概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:28.228819Z",
     "start_time": "2020-11-21T02:10:26.816123Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAO6UlEQVR4nO3df6jdd33H8edriYWJ1oq5Sk2aJRtRF8GKXls35lYnzqRDgiCsVSwrSlZmZX+2DGb/8B/Ff0SshlBCETYzmMXGES3C0A66alKotWmp3EVsryk0VVGof5S07/1xb+Xs5Nx7vsn9nnPv+ZznAy6c7/f7uee+P9zLK+98zvdHqgpJ0uz7g80uQJLUDwNdkhphoEtSIwx0SWqEgS5Jjdi+WT94x44dtWfPns368ZI0kx555JHnq2ph1LFNC/Q9e/Zw+vTpzfrxkjSTkvx8rWMuuUhSIwx0SWqEgS5JjTDQJakRBrokNWJsoCc5luS5JI+vcTxJvpxkKcljSd7Vf5mSpHG6dOj3AgfWOX4Q2Lf6dRj42sbLkiRdqrHnoVfVg0n2rDPkEPD1WrkP78NJrkpydVU921ONkrSp/u2HT3P/o7/o7f32v/lK7vrw23t7v1f0sYa+E3hmYHt5dd9FkhxOcjrJ6fPnz/fwoyVp8u5/9Bc88exvN7uMsfq4UjQj9o18akZVHQWOAiwuLvpkDUkzY//VV/Lv//Bnm13GuvoI9GXgmoHtXcC5Ht5Xki5J30sjr3ji2d+y/+ore3/fvvWx5HICuGX1bJf3Ar9x/VzSZpjU0sj+q6/k0DtHriRvKWM79CTfAG4AdiRZBu4CXgVQVUeAk8CNwBLwO+DWSRUrSePMwtLIpHQ5y+XmMccL+HRvFUmSLsum3T5XkoZtdA18Vta6J8VL/yVtGRtdA5+Vte5JsUOXtKXM8xr4RtmhS1IjDHRJaoRLLpI23Ssfhs77h5obZYcuadMNhvk8f6i5UXbokjqb9KX1fhi6MXbokjqb90vrtzo7dEkjjerG7aS3Njt0SSON6sbtpLc2O3Rpwia17jxpduOzxw5dmrBZedrNMLvx2WOHLk2Bna6mwUCXerLW0ooXy2haXHKRerLW0opLF5oWO3Rpg4YvW3dpRZvFDl3aIC9b11Zhh64mTfNUQTtzbRV26GrSNE8VtDPXVmGHrmbZNWveGOhqxuAyi6cKah655KJmDC6zuAyieWSHrpnknQCli9mhayZ5J0DpYnbompo+TyW0G5cuZoeuqenzVEK7celidui6iM+NlGaTHbou4nMjpdlkh66R7KSl2WOg6/eG7xooabZ0WnJJciDJU0mWktw54vjrknw7yY+TnElya/+latK8a6A028Z26Em2AXcDHwSWgVNJTlTVEwPDPg08UVUfTrIAPJXkX6vqxYlUrXVd7oeafmgpzbYuHfp1wFJVnV0N6OPAoaExBbw2SYDXAL8CLvRaqTq73A817cyl2dZlDX0n8MzA9jJw/dCYrwAngHPAa4G/q6qXh98oyWHgMMDu3bsvp965sNHTBu20pfnUpUPPiH01tP0h4FHgzcA7ga8kuehTtao6WlWLVbW4sLBwycXOi42eNminLc2nLh36MnDNwPYuVjrxQbcCn6+qApaS/Ax4G/CjXqpsmDeZktSXLh36KWBfkr1JrgBuYmV5ZdDTwAcAkrwJeCtwts9CW+VNpiT1ZWyHXlUXktwOPABsA45V1Zkkt60ePwJ8Drg3yU9YWaK5o6qen2DdTbEbl9SHThcWVdVJ4OTQviMDr88Bf9Nvae3yyTqSJsF7uWwCn6wjaRK89H9KRnXlLrNI6pMd+pTYlUuatOY69Endy3uj7MolTVpzHfqk7uW9UXblkiatuQ4dPA1Q0nxqrkOXpHlloEtSI5pZcvFpO5LmXTMduk/bkTTvZq5DX+u0RE8LlDTvZq5DX+u0RDtzSfNu5jp08LRESRpl5jp0SdJoBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqRGdAj3JgSRPJVlKcucaY25I8miSM0l+0G+ZkqRxxj6CLsk24G7gg8AycCrJiap6YmDMVcBXgQNV9XSSN06qYEnSaF069OuApao6W1UvAseBQ0NjPgbcV1VPA1TVc/2WKUkap0ug7wSeGdheXt036C3A65N8P8kjSW4Z9UZJDic5neT0+fPnL69iSdJIXQI9I/bV0PZ24N3A3wIfAv4lyVsu+qaqo1W1WFWLCwsLl1ysJGltY9fQWenIrxnY3gWcGzHm+ap6AXghyYPAtcBPe6lSkjRWlw79FLAvyd4kVwA3ASeGxtwPvC/J9iSvBq4Hnuy3VEnSesZ26FV1IcntwAPANuBYVZ1Jctvq8SNV9WSS7wKPAS8D91TV45MsXJL0/3VZcqGqTgInh/YdGdr+IvDF/kqTJF0KrxSVpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRnQI9yYEkTyVZSnLnOuPek+SlJB/tr0RJUhdjAz3JNuBu4CCwH7g5yf41xn0BeKDvIiVJ43Xp0K8DlqrqbFW9CBwHDo0Y9xngm8BzPdYnSeqoS6DvBJ4Z2F5e3fd7SXYCHwGOrPdGSQ4nOZ3k9Pnz5y+1VknSOroEekbsq6HtLwF3VNVL671RVR2tqsWqWlxYWOhaoySpg+0dxiwD1wxs7wLODY1ZBI4nAdgB3JjkQlV9q5cqJUljdQn0U8C+JHuBXwA3AR8bHFBVe195neRe4D8Nc0marrGBXlUXktzOytkr24BjVXUmyW2rx9ddN5ckTUeXDp2qOgmcHNo3Msir6u83XpYk6VJ5pagkNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqRKdAT3IgyVNJlpLcOeL4x5M8tvr1UJJr+y9VkrSesYGeZBtwN3AQ2A/cnGT/0LCfAX9VVe8APgcc7btQSdL6unTo1wFLVXW2ql4EjgOHBgdU1UNV9evVzYeBXf2WKUkap0ug7wSeGdheXt23lk8C3xl1IMnhJKeTnD5//nz3KiVJY3UJ9IzYVyMHJu9nJdDvGHW8qo5W1WJVLS4sLHSvUpI01vYOY5aBawa2dwHnhgcleQdwD3Cwqn7ZT3mSpK66dOingH1J9ia5ArgJODE4IMlu4D7gE1X10/7LlCSNM7ZDr6oLSW4HHgC2Aceq6kyS21aPHwE+C7wB+GoSgAtVtTi5siVJw7osuVBVJ4GTQ/uODLz+FPCpfkuTJF0KrxSVpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRBrokNcJAl6RGGOiS1AgDXZIaYaBLUiMMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktQIA12SGmGgS1IjDHRJaoSBLkmNMNAlqREGuiQ1wkCXpEYY6JLUCANdkhphoEtSIwx0SWqEgS5JjTDQJakRnQI9yYEkTyVZSnLniONJ8uXV448leVf/pUqS1jM20JNsA+4GDgL7gZuT7B8adhDYt/p1GPhaz3VKksbo0qFfByxV1dmqehE4DhwaGnMI+HqteBi4KsnVPdcqSVrH9g5jdgLPDGwvA9d3GLMTeHZwUJLDrHTw7N69+1JrBWD/m6+8rO+TpNZ1CfSM2FeXMYaqOgocBVhcXLzoeBd3ffjtl/NtktS8Lksuy8A1A9u7gHOXMUaSNEFdAv0UsC/J3iRXADcBJ4bGnABuWT3b5b3Ab6rq2eE3kiRNztgll6q6kOR24AFgG3Csqs4kuW31+BHgJHAjsAT8Drh1ciVLkkbpsoZOVZ1kJbQH9x0ZeF3Ap/stTZJ0KbxSVJIaYaBLUiMMdElqhIEuSY3IyueZm/CDk/PAzy/z23cAz/dYzixwzvPBOc+Hjcz5j6pqYdSBTQv0jUhyuqoWN7uOaXLO88E5z4dJzdklF0lqhIEuSY2Y1UA/utkFbALnPB+c83yYyJxncg1dknSxWe3QJUlDDHRJasSWDvR5fDh1hzl/fHWujyV5KMm1m1Fnn8bNeWDce5K8lOSj06xvErrMOckNSR5NcibJD6ZdY986/G2/Lsm3k/x4dc4zfdfWJMeSPJfk8TWO959fVbUlv1i5Ve//An8MXAH8GNg/NOZG4DusPDHpvcAPN7vuKcz5z4HXr74+OA9zHhj3X6zc9fOjm133FH7PVwFPALtXt9+42XVPYc7/DHxh9fUC8Cvgis2ufQNz/kvgXcDjaxzvPb+2coc+jw+nHjvnqnqoqn69uvkwK0+HmmVdfs8AnwG+CTw3zeImpMucPwbcV1VPA1TVrM+7y5wLeG2SAK9hJdAvTLfM/lTVg6zMYS2959dWDvS1Hjx9qWNmyaXO55Os/As/y8bOOclO4CPAEdrQ5ff8FuD1Sb6f5JEkt0ytusnoMuevAH/KyuMrfwL8U1W9PJ3yNkXv+dXpARebpLeHU8+QzvNJ8n5WAv0vJlrR5HWZ85eAO6rqpZXmbeZ1mfN24N3AB4A/BP4nycNV9dNJFzchXeb8IeBR4K+BPwG+l+S/q+q3ky5uk/SeX1s50Ofx4dSd5pPkHcA9wMGq+uWUapuULnNeBI6vhvkO4MYkF6rqW9MpsXdd/7afr6oXgBeSPAhcC8xqoHeZ863A52tlgXkpyc+AtwE/mk6JU9d7fm3lJZd5fDj12Dkn2Q3cB3xihru1QWPnXFV7q2pPVe0B/gP4xxkOc+j2t30/8L4k25O8GrgeeHLKdfapy5yfZuV/JCR5E/BW4OxUq5yu3vNry3boNYcPp+44588CbwC+utqxXqgZvlNdxzk3pcucq+rJJN8FHgNeBu6pqpGnv82Cjr/nzwH3JvkJK8sRd1TVzN5WN8k3gBuAHUmWgbuAV8Hk8stL/yWpEVt5yUWSdAkMdElqhIEuSY0w0CWpEQa6JDXCQJekRhjoktSI/wNDwt954QNOTgAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 下面我们利用4.3节相关代码绘制ROC曲线来评估模型预测的效果：\n",
    "from sklearn.metrics import roc_curve\n",
    "fpr, tpr, thres = roc_curve(y_test, y_pred_proba[:,1])\n",
    "import matplotlib.pyplot as plt\n",
    "plt.plot(fpr, tpr)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:28.239760Z",
     "start_time": "2020-11-21T02:10:28.230785Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8684772657918828"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过如下代码求出模型的AUC值：\n",
    "from sklearn.metrics import roc_auc_score\n",
    "score = roc_auc_score(y_test, y_pred_proba[:,1])\n",
    "\n",
    "score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:28.266719Z",
     "start_time": "2020-11-21T02:10:28.241756Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.40674368, 0.1901846 , 0.04100984, 0.33347666, 0.02858528],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 我们可以通过查看各个特征的特征重要性(feature importance)来得出信用卡欺诈行为判断中最重要的特征变量：\n",
    "clf.feature_importances_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:28.304590Z",
     "start_time": "2020-11-21T02:10:28.270680Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>特征名称</th>\n",
       "      <th>特征重要性</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>换设备次数</td>\n",
       "      <td>0.406744</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>换IP国次数</td>\n",
       "      <td>0.333477</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>支付失败次数</td>\n",
       "      <td>0.190185</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>换IP次数</td>\n",
       "      <td>0.041010</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>交易金额</td>\n",
       "      <td>0.028585</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     特征名称     特征重要性\n",
       "0   换设备次数  0.406744\n",
       "3  换IP国次数  0.333477\n",
       "1  支付失败次数  0.190185\n",
       "2   换IP次数  0.041010\n",
       "4    交易金额  0.028585"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 通过如下5.2.2节特征重要性相关知识点进行整理，方便结果呈现，代码如下：\n",
    "features = X.columns  # 获取特征名称\n",
    "importances = clf.feature_importances_  # 获取特征重要性\n",
    "\n",
    "# 通过二维表格形式显示\n",
    "importances_df = pd.DataFrame()\n",
    "importances_df['特征名称'] = features\n",
    "importances_df['特征重要性'] = importances\n",
    "importances_df.sort_values('特征重要性', ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**10.2.4 模型参数调优**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:28.314589Z",
     "start_time": "2020-11-21T02:10:28.308577Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV  \n",
    "parameters = {'max_depth': [1, 3, 5], 'n_estimators': [50, 100, 150], 'learning_rate': [0.01, 0.05, 0.1, 0.2]}  # 指定模型中参数的范围\n",
    "clf = XGBClassifier()  # 构建模型\n",
    "grid_search = GridSearchCV(clf, parameters, scoring='roc_auc', cv=5)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:34.563017Z",
     "start_time": "2020-11-21T02:10:28.316589Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'learning_rate': 0.05, 'max_depth': 1, 'n_estimators': 100}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 下面我们将数据传入网格搜索模型并输出参数最优值：\n",
    "grid_search.fit(X_train, y_train)  # 传入数据\n",
    "grid_search.best_params_  # 输出参数的最优值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:34.601913Z",
     "start_time": "2020-11-21T02:10:34.565012Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,\n",
       "              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,\n",
       "              importance_type='gain', interaction_constraints='',\n",
       "              learning_rate=0.05, max_delta_step=0, max_depth=1,\n",
       "              min_child_weight=1, missing=nan, monotone_constraints='()',\n",
       "              n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,\n",
       "              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,\n",
       "              tree_method='exact', validate_parameters=1, verbosity=None)"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 下面我们根据新的参数建模，首先重新搭建XGBoost分类器，并将训练集数据传入其中：\n",
    "clf = XGBClassifier(max_depth=1, n_estimators=100, learning_rate=0.05)\n",
    "clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-21T02:10:34.613882Z",
     "start_time": "2020-11-21T02:10:34.603908Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8563218390804598\n"
     ]
    }
   ],
   "source": [
    "# 因为我们是通过ROC曲线的AUC评分作为模型评价准则来进行参数调优的，因此通过如下代码我们来查看新的AUC值：\n",
    "y_pred_proba = clf.predict_proba(X_test)\n",
    "from sklearn.metrics import roc_auc_score\n",
    "score = roc_auc_score(y_test, y_pred_proba[:,1])\n",
    "print(score)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "将获得的AUC评分打印出来为：0.856，比原来没有调参前的0.866还略微低了些，有的读者可能会奇怪为什么调参后的结果还不如未调参时的结果，通常来说参数调优出现这种情况的概率较小，这里出现了也正好给大家解释下出现这种情况的原因。\n",
    "\n",
    "出现这种情况的原因是因为交叉验证，我们来简单回顾下5.3节K折交叉验证的思路：它是将原来的测试数据分为K份（这里cv=5，即5份），然后在这K份数据中，选K-1份作为训练数据，剩下的1份作为测试数据，训练K次，获得K个的ROC曲线下的AUC值，然后将K个AUC值取平均，取AUC值的均值为最大情况下的参数为模型的最优参数。注意这里AUC值的获取是基于训练集数据，只不过是将训练集数据中的1/K作为测试集数据，这里的测试集数据并不是真正的测试集数据y_test，这也是为什么参数调优后结果反而不如不调优的结果的原因。实际应用中，通常不太会出现调参结果不如不调参的结果，出现这种情况某种程度也是因为数据量较小的原因（像本案例为1000条数据）。其他关于参数调优的其他一些注意点请参考5.3节，这里不再赘述。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
